CS 329E Data Visualization - DV_FinalProject

For our final project we incorpate multiple data sets from the New York State official census database (focusing on crime rates and student obesity) and produce interesting visualizations/data operations using the skills we have learned this semester.

Authors: Bruno De Hoyos (bd8739), Estevan Garza (eg22453)

Importing data

We begin by importing some useful libraries:

source("../01 Data/lib.R", max.deparse.length=1e3)
## Loading required package: scatterplot3d
## Loading required package: rgl
## Loading required package: RColorBrewer

We now begin building our data frames. We build the data frames through the data.frame() function, and pass to it a URL to an OracleCloud server with an appropriate SQL query. In this case we query the tables “TYPE_CRIMES_BY_COUNTY”, “SENTENCES_TO_PROBATION_BY_YEAR”,“ADULT_ARRESTS_BY_COUNTY”, and “weight3” which we have already imported into the Oracle database, and and rename the dataframes ‘dft’, ‘dfs’, and ‘dfr’ respectively.

source("../01 Data/CrimeDataFrames.R", echo=TRUE, max.deparse.length=1e3)
## 
## > dft <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from TYPE_CRIMES_BY_COUNTY\""), 
## +     httpheader = c(DB = "jdbc:oracle:thin:@129.152.144.84:1521:ORCL", 
## +         USER = "C##cs329e_eg22453", PASS = "orcl_eg22453", MODE = "native_mode", 
## +         MODEL = "model", returnDimensions = "False", returnFor = "JSON"), 
## +     verbose = TRUE)))
## 
## > dfs <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from SENTENCES_TO_PROBATION_BY_YEAR\""), 
## +     httpheader = c(DB = "jdbc:oracle:thin:@129.152.144.84:1521:ORCL", 
## +         USER = "C##cs329e_eg22453", PASS = "orcl_eg22453", MODE = "native_mode", 
## +         MODEL = "model", returnDimensions = "False", returnFor = "JSON"), 
## +     verbose = TRUE)))
## 
## > dfr <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from ADULT_ARRESTS_BY_COUNTY\""), 
## +     httpheader = c(DB = "jdbc:oracle:thin:@129.152.144.84:1521:ORCL", 
## +         USER = "C##cs329e_eg22453", PASS = "orcl_eg22453", MODE = "native_mode", 
## +         MODEL = "model", returnDimensions = "False", returnFor = "JSON"), 
## +     verbose = TRUE)))
## 
## > dfr2 <- dfr %>% filter(COUNTY != "Unknown-NonNYS", 
## +     COUNTY != "Unknown NYC county")
## 
## > dfs2 <- dfs %>% filter(CONVICTION == "Felony")
source("../01 Data/StudentWeightDataFrame.R", echo=TRUE, max.deparse.length=1e3)
## 
## > student_weight <- data.frame(eval(parse(text = substring(getURL(URLencode("http://129.152.144.84:5001/rest/native/?query=\"select * from weight3\""), 
## +     httpheader = c(DB = "jdbc:oracle:thin:@129.152.144.84:1521:ORCL", 
## +         USER = "C##cs329e_bd8739", PASS = "orcl_bd8739", MODE = "native_mode", 
## +         MODEL = "model", returnFor = "R", returnDimensions = "False"), 
## +     verbose = TRUE), 1, 2^31 - 1))))
## 
## > student_weight[, 1] <- as.numeric(as.character(student_weight[, 
## +     1]))
## 
## > student_weight[, 6] <- as.numeric(as.character(student_weight[, 
## +     6]))
## 
## > student_weight[, 7] <- as.numeric(as.character(student_weight[, 
## +     7]))
## 
## > student_weight[, 8] <- as.numeric(as.character(student_weight[, 
## +     8]))
## 
## > student_weight[, 9] <- as.numeric(as.character(student_weight[, 
## +     9]))
## 
## > student_weight[, 10] <- as.numeric(as.character(student_weight[, 
## +     10]))
## 
## > student_weight[, 11] <- as.numeric(as.character(student_weight[, 
## +     11]))
## 
## > student_weight[, 17] <- as.numeric(as.character(student_weight[, 
## +     17]))
## 
## > student_weight[, 18] <- as.numeric(as.character(student_weight[, 
## +     18]))
## 
## > student_weight[, 19] <- as.numeric(as.character(student_weight[, 
## +     19]))

We then display the ‘head’ of the data frames to ensure all the data was loaded in properly.

source("../01 Data/DataFramShow.R", echo=TRUE, max.deparse.length=1e3)
## 
## > tbl_df(dft)
## Source: local data frame [1,488 x 11]
## 
##       COUNTY YEAR POPULATION INDEX_COUNT INDEX_RATE VIOLENT_COUNT
## 1       Erie 2009     905273       32656     3607.3          4915
## 2      Essex 2009      36495         519     1422.1            49
## 3   Franklin 2009      50448         937     1857.4            65
## 4     Fulton 2009      54912        1522     2771.7           112
## 5    Genesee 2009      57420        1497     2607.1            92
## 6     Greene 2009      49071         853     1738.3           102
## 7   Hamilton 2009       4980          51     1024.1             2
## 8   Herkimer 2009      61943        1369     2210.1           168
## 9  Jefferson 2009     118766        2701     2274.2           224
## 10     Kings 2009    2567968       55816     2173.5         15803
## ..       ...  ...        ...         ...        ...           ...
## Variables not shown: VIOLENT_RATE (dbl), PROPERTY_COUNT (int),
##   PROPERTY_RATE (dbl), FIREARM_COUNT (fctr), FIREARM_RATE (fctr)
## 
## > tbl_df(dfs2)
## Source: local data frame [2,728 x 9]
## 
##         COUNTY YEAR CONVICTION VIOLENT_FELONY_OFFENSES DRUGS DWI PROPERTY
## 1      Orleans 2011     Felony                       0    12   2        6
## 2       Oswego 2011     Felony                       3     6  25       31
## 3       Otsego 2011     Felony                       0     2  14        9
## 4       Putnam 2011     Felony                       4     3  20        8
## 5       Queens 2011     Felony                     256    64  67      108
## 6   Rensselaer 2011     Felony                      10    14  34        9
## 7     Richmond 2011     Felony                      38    31   3       22
## 8     Rockland 2011     Felony                      15    40  40       48
## 9     Saratoga 2011     Felony                      10    15  74       18
## 10 Schenectady 2011     Felony                      14    13  32       18
## ..         ...  ...        ...                     ...   ... ...      ...
## Variables not shown: OTHER (int), TOTAL (int)
## 
## > tbl_df(dfr2)
## Source: local data frame [2,728 x 13]
## 
##          COUNTY YEAR TOTAL FELONY_TOTAL DRUG_FELONY VIOLENT_FELONY
## 1      Schuyler 2006   380           81          12             11
## 2        Seneca 2006   760          187          22             31
## 3       Steuben 2006  2156          625          96            134
## 4  St. Lawrence 2006  2843          768         125            141
## 5       Suffolk 2006 26353         7523        1424           1770
## 6      Sullivan 2006  2161          749         141            179
## 7         Tioga 2006   813          222          30             38
## 8      Tompkins 2006  1578          344          46             97
## 9        Ulster 2006  5024         1249         230            234
## 10       Warren 2006  2325          616          48            121
## ..          ...  ...   ...          ...         ...            ...
## Variables not shown: DWI_FELONY (int), OTHER_FELONY (int),
##   MISDEMEANOR_TOTAL (int), DRUG_MISD (int), DWI_MISD (int), PROPERTY_MISD
##   (int), OTHER_MISD (int)
## 
## > tbl_df(student_weight)
## Source: local data frame [3,270 x 19]
## 
##    LOCATION_CODE      COUNTY                               AREA_NAME
## 1          30701      BROOME CHENANGO VALLEY CENTRAL SCHOOL DISTRICT
## 2          31701      BROOME         WINDSOR CENTRAL SCHOOL DISTRICT
## 3          31701      BROOME         WINDSOR CENTRAL SCHOOL DISTRICT
## 4          31701      BROOME         WINDSOR CENTRAL SCHOOL DISTRICT
## 5          40302 CATTARAUGUS                  ALLEGANY-LIMESTONE CSD
## 6          40302 CATTARAUGUS                  ALLEGANY-LIMESTONE CSD
## 7          40302 CATTARAUGUS                  ALLEGANY-LIMESTONE CSD
## 8          40901 CATTARAUGUS   ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
## 9          40901 CATTARAUGUS   ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
## 10         40901 CATTARAUGUS   ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
## ..           ...         ...                                     ...
## Variables not shown: REGION (fctr), SCHOOL_YEARS (fctr), NO_OVERWEIGHT
##   (dbl), PCT_OVERWEIGHT (dbl), NO_OBESE (dbl), PCT_OBESE (dbl),
##   NO_OVERWEIGHT_OR_OBESE (dbl), PCT_OVERWEIGHT_OR_OBESE (dbl), GRADE_LEVEL
##   (fctr), AREA_TYPE (fctr), STREET_ADDRESS (fctr), CITY (fctr), STATE
##   (fctr), ZIP_CODE (dbl), LAT (dbl), LON (dbl)

Joining data frames:

Now that we have the three tables loaded into Rstudio, we can now join the tables. We join all three tables by first joining any two tables based on the commmon ‘COUNTY’ and ‘YEAR’ columns, placing that table into a temporary data frame, and then joining the temporary data frame with the third table. We do this operation twice, once for a full-join, and another time for an inner-join. We also make use of the complete.cases function for the final tables in order to remove any NA values in rows. We also convert the types of some of the columns to the numerical format, using the as.numeric() function, to make faciliate plotting this data.

source("../01 Data/CrimeDataFrameJoin.R", echo=TRUE, max.deparse.length=1e3)
## 
## > temp <- full_join(dfs2, dfr2, by = c("COUNTY", "YEAR"))
## 
## > full <- full_join(temp, dft, by = c("COUNTY", "YEAR"))
## 
## > cleantbl <- full[complete.cases(full), ]
## 
## > cleantbl[, 4] <- as.numeric(as.character(cleantbl[, 
## +     4]))
## 
## > cleantbl[, 5] <- as.numeric(as.character(cleantbl[, 
## +     5]))
## 
## > cleantbl[, 6] <- as.numeric(as.character(cleantbl[, 
## +     6]))
## 
## > cleantbl[, 7] <- as.numeric(as.character(cleantbl[, 
## +     7]))
## 
## > cleantbl[, 8] <- as.numeric(as.character(cleantbl[, 
## +     8]))
## 
## > cleantbl[, 9] <- as.numeric(as.character(cleantbl[, 
## +     9]))
## 
## > cleantbl[, 10] <- as.numeric(as.character(cleantbl[, 
## +     10]))
## 
## > cleantbl[, 11] <- as.numeric(as.character(cleantbl[, 
## +     11]))
## 
## > cleantbl[, 12] <- as.numeric(as.character(cleantbl[, 
## +     12]))
## 
## > cleantbl[, 13] <- as.numeric(as.character(cleantbl[, 
## +     13]))
## 
## > cleantbl[, 14] <- as.numeric(as.character(cleantbl[, 
## +     14]))
## 
## > cleantbl[, 15] <- as.numeric(as.character(cleantbl[, 
## +     15]))
## 
## > cleantbl[, 16] <- as.numeric(as.character(cleantbl[, 
## +     16]))
## 
## > cleantbl[, 17] <- as.numeric(as.character(cleantbl[, 
## +     17]))
## 
## > cleantbl[, 18] <- as.numeric(as.character(cleantbl[, 
## +     18]))
## 
## > cleantbl[, 19] <- as.numeric(as.character(cleantbl[, 
## +     19]))
## 
## > cleantbl[, 20] <- as.numeric(as.character(cleantbl[, 
## +     20]))
## 
## > cleantbl[, 21] <- as.numeric(as.character(cleantbl[, 
## +     21]))
## 
## > cleantbl[, 22] <- as.numeric(as.character(cleantbl[, 
## +     22]))
## 
## > cleantbl[, 23] <- as.numeric(as.character(cleantbl[, 
## +     23]))
## 
## > cleantbl[, 24] <- as.numeric(as.character(cleantbl[, 
## +     24]))
## 
## > cleantbl[, 25] <- as.numeric(as.character(cleantbl[, 
## +     25]))
## 
## > cleantbl[, 26] <- as.numeric(as.character(cleantbl[, 
## +     26]))
## 
## > cleantbl[, 27] <- as.numeric(as.character(cleantbl[, 
## +     27]))
## 
## > cleantbl[, 28] <- as.numeric(as.character(cleantbl[, 
## +     28]))
## 
## > cleantbl[, 29] <- as.numeric(as.character(cleantbl[, 
## +     29]))
## 
## > cleantbl <- cleantbl[complete.cases(cleantbl), ]
## 
## > head(cleantbl)
##       COUNTY YEAR CONVICTION VIOLENT_FELONY_OFFENSES DRUGS DWI PROPERTY
## 1    Orleans 2011     Felony                       0    12   2        6
## 2     Oswego 2011     Felony                       3     6  25       31
## 3     Otsego 2011     Felony                       0     2  14        9
## 4     Putnam 2011     Felony                       4     3  20        8
## 6 Rensselaer 2011     Felony                      10    14  34        9
## 8   Rockland 2011     Felony                      15    40  40       48
##   OTHER TOTAL.x TOTAL.y FELONY_TOTAL DRUG_FELONY VIOLENT_FELONY DWI_FELONY
## 1     5      25     858          249          47             65         22
## 2    10      75    3180          893          94            208         91
## 3     1      26    1546          427          76             82         44
## 4     7      42    1560          372          98             66         43
## 6    26      93    3485         1094         167            255         79
## 8    46     189    4413         1420         274            367        111
##   OTHER_FELONY MISDEMEANOR_TOTAL DRUG_MISD DWI_MISD PROPERTY_MISD
## 1          115               609        42      107           214
## 2          500              2287       186      472           728
## 3          225              1119        89      291           335
## 4          165              1188       189      451           178
## 6          593              2391       210      425           782
## 8          668              2993       252      630           919
##   OTHER_MISD POPULATION INDEX_COUNT INDEX_RATE VIOLENT_COUNT VIOLENT_RATE
## 1        246      43076        1004     2330.8            82        190.4
## 2        901     122658        3206     2613.8           181        147.6
## 3        404      62539        1144     1829.3           119        190.3
## 4        370     100158         860      858.6            66         65.9
## 6        974     160146        4682     2923.6           485        302.8
## 8       1192     313088        4564     1457.7           553        176.6
##   PROPERTY_COUNT PROPERTY_RATE FIREARM_COUNT FIREARM_RATE
## 1            922        2140.4             5         11.6
## 2           3025        2466.2            17         13.9
## 3           1025        1639.0            18         28.8
## 4            794         792.7             1          1.0
## 6           4197        2620.7           112         69.9
## 8           4011        1281.1            44         14.1
## 
## > inner_temp <- inner_join(dfs2, dfr2, by = c("COUNTY", 
## +     "YEAR"))
## 
## > inner <- inner_join(inner_temp, dft, by = c("COUNTY", 
## +     "YEAR"))
## 
## > cleaninner <- inner[complete.cases(inner), ]
## 
## > cleaninner[, 4] <- as.numeric(as.character(cleaninner[, 
## +     4]))
## 
## > cleaninner[, 5] <- as.numeric(as.character(cleaninner[, 
## +     5]))
## 
## > cleaninner[, 6] <- as.numeric(as.character(cleaninner[, 
## +     6]))
## 
## > cleaninner[, 7] <- as.numeric(as.character(cleaninner[, 
## +     7]))
## 
## > cleaninner[, 8] <- as.numeric(as.character(cleaninner[, 
## +     8]))
## 
## > cleaninner[, 9] <- as.numeric(as.character(cleaninner[, 
## +     9]))
## 
## > cleaninner[, 10] <- as.numeric(as.character(cleaninner[, 
## +     10]))
## 
## > cleaninner[, 11] <- as.numeric(as.character(cleaninner[, 
## +     11]))
## 
## > cleaninner[, 12] <- as.numeric(as.character(cleaninner[, 
## +     12]))
## 
## > cleaninner[, 13] <- as.numeric(as.character(cleaninner[, 
## +     13]))
## 
## > cleaninner[, 14] <- as.numeric(as.character(cleaninner[, 
## +     14]))
## 
## > cleaninner[, 15] <- as.numeric(as.character(cleaninner[, 
## +     15]))
## 
## > cleaninner[, 16] <- as.numeric(as.character(cleaninner[, 
## +     16]))
## 
## > cleaninner[, 17] <- as.numeric(as.character(cleaninner[, 
## +     17]))
## 
## > cleaninner[, 18] <- as.numeric(as.character(cleaninner[, 
## +     18]))
## 
## > cleaninner[, 19] <- as.numeric(as.character(cleaninner[, 
## +     19]))
## 
## > cleaninner[, 20] <- as.numeric(as.character(cleaninner[, 
## +     20]))
## 
## > cleaninner[, 21] <- as.numeric(as.character(cleaninner[, 
## +     21]))
## 
## > cleaninner[, 22] <- as.numeric(as.character(cleaninner[, 
## +     22]))
## 
## > cleaninner[, 23] <- as.numeric(as.character(cleaninner[, 
## +     23]))
## 
## > cleaninner[, 24] <- as.numeric(as.character(cleaninner[, 
## +     24]))
## 
## > cleaninner[, 25] <- as.numeric(as.character(cleaninner[, 
## +     25]))
## 
## > cleaninner[, 26] <- as.numeric(as.character(cleaninner[, 
## +     26]))
## 
## > cleaninner[, 27] <- as.numeric(as.character(cleaninner[, 
## +     27]))
## 
## > cleaninner[, 28] <- as.numeric(as.character(cleaninner[, 
## +     28]))
## 
## > cleaninner[, 29] <- as.numeric(as.character(cleaninner[, 
## +     29]))
## 
## > cleaninner <- inner[complete.cases(inner), ]

Data Wrangling:

The next step in the project is to start doing some data wrangling. The first wrangling procedure involves filtering out the top 50th percentile of the data with respect to DWI misdemeanors.

source("../02 Data Wrangling/Wrangle1.R", echo = TRUE, max.deparse.length=1e3)
## 
## > wr1 <- cleantbl %>% mutate(misd_pct = cume_dist(DWI_MISD)) %>% 
## +     filter(misd_pct <= 0.5)

The second wrangling procedure makes use of the mutate function and adds a new column to the joined tables representing the sum of Property, Violence, and Firearm crime rates in each New York county.

source("../02 Data Wrangling/Wrangle2.R", echo = TRUE, max.deparse.length=1e3)
## 
## > wr2 <- cleantbl %>% mutate(prop_vio_fire = PROPERTY_RATE + 
## +     VIOLENT_RATE + FIREARM_RATE)

The next and most complex wrangling procedure involves selecting various crime related columns (which including different variations of felony and misdemeanor crimes), groups them by year, and summarises the mean for each crime column. We utilize the melt function introduced at the beginning of the semester in order to stack the respective column data on top of each other, which all share the YEAR column.

source("../02 Data Wrangling/Wrangle3.R", echo = TRUE, max.deparse.length=1e3)
## 
## > wr3 <- cleantbl %>% select(YEAR, DRUGS, DWI, PROPERTY, 
## +     DRUG_FELONY, VIOLENT_FELONY, DWI_FELONY, DRUG_MISD, DWI_MISD, 
## +     PROPERTY_MISD, VIOLENT_COUNT, PROPERTY_COUNT, FIREARM_COUNT) %>% 
## +     group_by(YEAR) %>% summarise(`Drug Probation` = mean(DRUGS), 
## +     `DWI Probation` = mean(DWI), `Property Probation` = mean(PROPERTY), 
## +     `Drug Felonies` = mean(DRUG_FELONY), `Violent Felonies` = mean(VIOLENT_FELONY), 
## +     `DWI Felonies` = mean(DWI_FELONY), `Drug Misdemeanors` = mean(DRUG_MISD), 
## +     `DWI Misdemeanors` = mean(DWI_MISD), `Property Misdemeanors` = mean(PROPERTY_MISD), 
## +     `Violent Crime Count` = mean(VIOLENT_COUNT), `Property Crime Count` = mean(PROPERTY_COUNT), 
## +     `Firearm Crime Count` = mean(FIREARM_COUNT)) %>% melt(., 
## +     id.vars = "YEAR", measure.vars = c("Drug Probation", "DWI Probation", 
## +         "Property Probation", "Drug Felonies", "Violent Felonies", 
## +         "DWI Felonies", "Drug Misdemeanors", "DWI Misdemeanors", 
## +         "Property M ..." ... [TRUNCATED]

Now we implement the first of our data wrangling for the student obesity table and through use of the pipe operators, filter, mutate and cumulative distribution functions, produce a new data frame with a new variable, “owobpct”. This variable ranks each individual school district level (not the DISTRICT_TOTAL numbers however which have been filtered out using the != operator) in orders of percentiles concerning overweight OR obese students.

source("../02 Data Wrangling/student_wrangle1.R", echo = TRUE, max.deparse.length=1e3)
## 
## > st1 <- student_weight %>% mutate(ow_ob_pct = cume_dist(NO_OVERWEIGHT_OR_OBESE)) %>% 
## +     filter(ow_ob_pct <= 0.9, GRADE_LEVEL != "DISTRICT TOTAL")
## 
## > head(st1)
##   LOCATION_CODE      COUNTY                               AREA_NAME
## 1         30701      BROOME CHENANGO VALLEY CENTRAL SCHOOL DISTRICT
## 2         31701      BROOME         WINDSOR CENTRAL SCHOOL DISTRICT
## 3         31701      BROOME         WINDSOR CENTRAL SCHOOL DISTRICT
## 4         40302 CATTARAUGUS                  ALLEGANY-LIMESTONE CSD
## 5         40302 CATTARAUGUS                  ALLEGANY-LIMESTONE CSD
## 6         40901 CATTARAUGUS   ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
##             REGION SCHOOL_YEARS NO_OVERWEIGHT PCT_OVERWEIGHT NO_OBESE
## 1  NEW YORK - PENN    2012-2013            42          0.170       58
## 2  NEW YORK - PENN    2012-2013            54          0.139       65
## 3  NEW YORK - PENN    2012-2013            44          0.175       61
## 4 WESTERN NEW YORK    2012-2013            69          0.174       56
## 5 WESTERN NEW YORK    2012-2013            25          0.153       35
## 6 WESTERN NEW YORK    2012-2013            30          0.203       23
##   PCT_OBESE NO_OVERWEIGHT_OR_OBESE PCT_OVERWEIGHT_OR_OBESE GRADE_LEVEL
## 1     0.235                    100                    0.41 MIDDLE/HIGH
## 2     0.167                    119                    0.31  ELEMENTARY
## 3     0.242                    105                    0.42 MIDDLE/HIGH
## 4     0.141                    125                    0.32  ELEMENTARY
## 5     0.215                     60                    0.37 MIDDLE/HIGH
## 6     0.155                     53                    0.36  ELEMENTARY
##         AREA_TYPE         STREET_ADDRESS          CITY STATE ZIP_CODE
## 1 SCHOOL DISTRICT 221 CHENANGO BRIDGE RD    BINGHAMTON    NY    13901
## 2 SCHOOL DISTRICT          1191 NY RT 79       WINDSOR    NY    13865
## 3 SCHOOL DISTRICT          1191 NY RT 79       WINDSOR    NY    13865
## 4 SCHOOL DISTRICT      3131 FIVE MILE RD      ALLEGANY    NY    14706
## 5 SCHOOL DISTRICT      3131 FIVE MILE RD      ALLEGANY    NY    14706
## 6 SCHOOL DISTRICT            5873 RT 219 ELLICOTTVILLE    NY    14731
##        LAT       LON ow_ob_pct
## 1 42.16298 -75.87401 0.4615877
## 2 42.06670 -75.64087 0.5272087
## 3 42.06670 -75.64087 0.4801536
## 4 42.13191 -78.50214 0.5473752
## 5 42.13191 -78.50214 0.2810499
## 6 42.25484 -78.65117 0.2362356

The next step of this data wrangling process expands on the first by filtering the table further. In this data frame the STATEWIDE region is not included and the top ten percentile of districts with the most overweight or obese students are not included.

source("../02 Data Wrangling/student_wrangle2.R", echo = TRUE, max.deparse.length=1e3)
## 
## > st2 <- student_weight %>% mutate(ow_ob_pct = cume_dist(NO_OVERWEIGHT_OR_OBESE)) %>% 
## +     filter(REGION != "STATEWIDE (EXCLUDING NYC)", GRADE_LEVEL != 
## +         "DISTRICT TOTAL", ow_ob_pct <= 0.9)
## 
## > head(st2)
##   LOCATION_CODE      COUNTY                               AREA_NAME
## 1         30701      BROOME CHENANGO VALLEY CENTRAL SCHOOL DISTRICT
## 2         31701      BROOME         WINDSOR CENTRAL SCHOOL DISTRICT
## 3         31701      BROOME         WINDSOR CENTRAL SCHOOL DISTRICT
## 4         40302 CATTARAUGUS                  ALLEGANY-LIMESTONE CSD
## 5         40302 CATTARAUGUS                  ALLEGANY-LIMESTONE CSD
## 6         40901 CATTARAUGUS   ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
##             REGION SCHOOL_YEARS NO_OVERWEIGHT PCT_OVERWEIGHT NO_OBESE
## 1  NEW YORK - PENN    2012-2013            42          0.170       58
## 2  NEW YORK - PENN    2012-2013            54          0.139       65
## 3  NEW YORK - PENN    2012-2013            44          0.175       61
## 4 WESTERN NEW YORK    2012-2013            69          0.174       56
## 5 WESTERN NEW YORK    2012-2013            25          0.153       35
## 6 WESTERN NEW YORK    2012-2013            30          0.203       23
##   PCT_OBESE NO_OVERWEIGHT_OR_OBESE PCT_OVERWEIGHT_OR_OBESE GRADE_LEVEL
## 1     0.235                    100                    0.41 MIDDLE/HIGH
## 2     0.167                    119                    0.31  ELEMENTARY
## 3     0.242                    105                    0.42 MIDDLE/HIGH
## 4     0.141                    125                    0.32  ELEMENTARY
## 5     0.215                     60                    0.37 MIDDLE/HIGH
## 6     0.155                     53                    0.36  ELEMENTARY
##         AREA_TYPE         STREET_ADDRESS          CITY STATE ZIP_CODE
## 1 SCHOOL DISTRICT 221 CHENANGO BRIDGE RD    BINGHAMTON    NY    13901
## 2 SCHOOL DISTRICT          1191 NY RT 79       WINDSOR    NY    13865
## 3 SCHOOL DISTRICT          1191 NY RT 79       WINDSOR    NY    13865
## 4 SCHOOL DISTRICT      3131 FIVE MILE RD      ALLEGANY    NY    14706
## 5 SCHOOL DISTRICT      3131 FIVE MILE RD      ALLEGANY    NY    14706
## 6 SCHOOL DISTRICT            5873 RT 219 ELLICOTTVILLE    NY    14731
##        LAT       LON ow_ob_pct
## 1 42.16298 -75.87401 0.4615877
## 2 42.06670 -75.64087 0.5272087
## 3 42.06670 -75.64087 0.4801536
## 4 42.13191 -78.50214 0.5473752
## 5 42.13191 -78.50214 0.2810499
## 6 42.25484 -78.65117 0.2362356

The final wrangling procedure involves filtering out the top ten percentile of ‘Number of Overweight or Obese’, region ‘Statewide’, and grade level ‘District total’. Then we use a group_by function that groups ‘City’ and ‘Region’ together and obtains the mean value of ‘Percent Overweight or Obese’ based on the groupings.

source("../02 Data Wrangling/student_wrangle3.R", echo = TRUE, max.deparse.length=1e3)
## 
## > st3 <- student_weight %>% mutate(pct_cume_dist = cume_dist(PCT_OVERWEIGHT_OR_OBESE)) %>% 
## +     filter(pct_cume_dist <= 0.9, REGION != "STATEWIDE (EXCLUDING NYC)", 
## +         GRADE_LEVEL != "DISTRICT TOTAL") %>% group_by(CITY, REGION) %>% 
## +     mutate(mean = mean(PCT_OVERWEIGHT_OR_OBESE))
## 
## > head(st3)
## Source: local data frame [6 x 21]
## Groups: CITY, REGION
## 
##   LOCATION_CODE      COUNTY                               AREA_NAME
## 1         30701      BROOME CHENANGO VALLEY CENTRAL SCHOOL DISTRICT
## 2         31701      BROOME         WINDSOR CENTRAL SCHOOL DISTRICT
## 3         31701      BROOME         WINDSOR CENTRAL SCHOOL DISTRICT
## 4         40302 CATTARAUGUS                  ALLEGANY-LIMESTONE CSD
## 5         40302 CATTARAUGUS                  ALLEGANY-LIMESTONE CSD
## 6         40901 CATTARAUGUS   ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
## Variables not shown: REGION (fctr), SCHOOL_YEARS (fctr), NO_OVERWEIGHT
##   (dbl), PCT_OVERWEIGHT (dbl), NO_OBESE (dbl), PCT_OBESE (dbl),
##   NO_OVERWEIGHT_OR_OBESE (dbl), PCT_OVERWEIGHT_OR_OBESE (dbl), GRADE_LEVEL
##   (fctr), AREA_TYPE (fctr), STREET_ADDRESS (fctr), CITY (fctr), STATE
##   (fctr), ZIP_CODE (dbl), LAT (dbl), LON (dbl), pct_cume_dist (dbl), mean
##   (dbl)

Making Some Plots:

Now we start plotting out data using the data wrangling procedures performed above. This first graph shows the number of DWI misdemeanor arrests against county population, and then facet-wraps and color codes the data with respect to year. We then apply a linear fit to all of the plots, in order to see the general increasing trend in DWI misdemeanors with increase in county population for every year.

source("../03 Visualizations/Plot1.R", echo = TRUE, max.deparse.length=1e3)
## 
## > wr1 %>% ggplot(aes(x = DWI_MISD, y = POPULATION, color = factor(YEAR))) + 
## +     geom_point() + stat_smooth(method = "lm", se = FALSE) + facet_wrap(~YEAR) + 
## +     ylim(c(0, 150000)) + theme(legend.position = "none", plot.title = element_text(size = 16, 
## +     face = "bold", vjust = 2, family = "Helvetica Neue Light")) + 
## +     labs(x = "Number of DWI Misdemeanor Arrests", y = "County Population") + 
## +     ggtitle("DWI Misdemeanor Arrests vs NY County Population by Year")

Our second plot makes use of the second wrangling procedure by plotting the density distribution of the sum of property, violence, and firearm crime rates. Notice that this plot is done on a semi-log scale along the x-axis, in order to better show the distribution which would otherwise be difficult to see using an ordinar scale. There are different distribution curves pertaining to each year of data, and it is color-coded per year in order to see the time-dependent trend. Overall, we see that this data follows a normal distribution every year. From this, we see that the mean of the sum of all of these crime rates decreases as time increased (there were less crimes as time progressed, on average), but we also see that the variance of these averages decreased with increasing time, as seen by the taller and narrower distribution curves from recent years.

source("../03 Visualizations/Plot2.R", echo = TRUE, max.deparse.length=1e3)
## 
## > wr2 %>% ggplot(aes(x = prop_vio_fire, color = YEAR, 
## +     group = YEAR)) + geom_density(adjust = 3) + scale_x_log10() + 
## +     scale_color_continuous(low = "#656565", high = "#FF6200", 
## +         name = "Year") + labs(x = "log( Combined Property, Firearm, and Violence Crime Rate )", 
## +     y = "Density") + theme(plot.title = element_text(size = 20, 
## +     face = "bold", vjust = 2, family = "Helvetica Neue Light")) + 
## +     ggtitle("New York Property, Firearm, and Violence\nCrime Rate Density Distributions Per Year")

The third plot obviously is used alongside the most complicated wrangling procedure. With this graph we plot all of the melted averages with respect to year on one plot as points. We then plot a fitted curve line with a gray error width, as well as a dotted linear fitted line. As one can see, we make use of the ggplot functions that make our plots more aesthetically pleasing including things like graph titles, unique fonts and color coded regression lines. The linear fitted lines help us see that, despite the fluctuations seen in the data, all of the crime types seem to decrease with time. This may be an indication that New York keeps getting more safe every year, with the decreasing trends in almost every column.

source("../03 Visualizations/Plot3.R", echo = TRUE, max.deparse.length=1e3)
## 
## > wr3 %>% ggplot(aes(x = YEAR, y = value, color = variable)) + 
## +     geom_point(alpha = 0.5) + stat_smooth(alpha = 0.15) + scale_y_log10() + 
## +     theme(panel.background = element_rect(fill = "white"), panel.grid.major = element_line(colour = "#B2B2B2", 
## +         size = 0.25), panel.grid.minor = element_line(colour = "#B2B2B2", 
## +         size = 0.25)) + stat_smooth(method = "lm", se = FALSE, 
## +     linetype = "dotted") + labs(x = "Year", y = "log( Mean Values )") + 
## +     theme(plot.title = element_text(size = 20, face = "bold", 
## +         vjust = 2, family = "Helvetica Neue Light")) + ggtitle("New York Crime Averages Per Year") + 
## +     scale_color_discrete(name = "Crime Type")
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Next we plot a geometric point graph that shows the relationship between the number of obese children and number of overweight children in both elementary and middle/high schools.

source("../03 Visualizations/plot1student.R", echo = TRUE, max.deparse.length=1e3)
## 
## > st1 %>% ggplot(aes(x = NO_OVERWEIGHT, y = NO_OBESE, 
## +     color = GRADE_LEVEL)) + geom_point() + facet_wrap(~REGION) + 
## +     stat_smooth(method = "lm", se = FALSE) + labs(x = "Number Overweight", 
## +     y = "Number Obese")

It’s interesting to see that elementary schools, although on average having less of each, have more more “outliers” in the data set. This could be attributed to the restirictions of school sports in elemenary schools or some other unobservable factor(s).

source("../03 Visualizations/plot3student.R", echo = TRUE, max.deparse.length=1e3)
## 
## > st3 %>% ggplot(aes(x = LON, y = LAT, color = mean)) + 
## +     geom_point() + coord_equal() + scale_color_gradient(high = "red", 
## +     low = "green", name = "Mean percentage\nobese or overweight") + 
## +     facet_wrap(~SCHOOL_YEARS) + labs(x = "Longitude", y = "Latitude")
## Warning: Removed 14 rows containing missing values (geom_point).

Plotting the individal lat and long coordinates on point graph we are able to see that the data set was taken from all over the state of New York in a relatively thourough study. The outline of each graph is the obvious form of the state itself. Furthermore we are able to deduce from the graph(s) 2 things.

  1. Although being overweight or obese was generally seen in all parts of the state there was still a correlation with higher percentages( >.4) in the norther half of New York

  2. Between 2012 and 2013 the states’ school districts saw a significant decrease in the amount of overweight or obese students.

Using Tableau

Importing data to Tableau

We begin by importing the data to Tableau. Since both of our data tables were previously stored in an Oracle database during an earlier project, we use Tableau’s “Connect to Data” wizard by selecting Oracle in the sidebar. Our data can be accessed by using the following credentials:

Server: 129.152.144.84
Service: ORCL.usuniversi01134.oraclecloud.internal
Port: 1521
Username: C##cs329e_eg22453
Password: orcl_eg22453

We do this process two separate times under the same workbook, in order to import the two tables separately and avoid “joining” them.

Adding Relationships

Once our data is loaded into Tableau, we must add relationships to the data tables to ensure Tableau is able to use them properly. To do this, we go to Data >> Edit Relationships... and add the appropriate relationship that links the data together. For our project, our data was joined by the County and Year columns, as shown below.

Data Blending

Blend 1

We are now ready to create some interesting visualizations. For our first image, we wanted to see how drug and violence-relatated sentences correlated with drug felonies. We did this using a simple line graph, plotting year on the x-axis and the two measures of interest (Drug Sentences, Violent Felony Sentences) on the y-axis. In order to make use of the second data table, we adjusted the size and color of the graphs according to the number of drug arrests. It is clear from the graph colors and the upper graph, that with an increase in drug arrests and sentences, there was a definite spike in the number of violent sentences.

Blend 2

For our second graph, we wanted to see how the ratio of sentences to arrests varied for different types of crimes. In order to do this, we created several calculated fields for drug, DWI, and property crimes. These calculated fields were simply a ratio of the sum of the sentences to the sum of the arrests for each specific crime type, as shown below for the case of drugs.

With these three calculated ratios (drugs, DWI, and property), we then created plots with these the ratios (x-axis) against the total number of sentences (y-axis) across all crime types, to see if the ratio would change. Surprisingly, we found that the ratio of sentences to arrests was fairly consistent within 0 to 0.5 for these three crime types, regardless if the data point was from a high sentence count (5K), or a low one (50). In order to spice up the data, we colored the points according to the sum of the total number of arrests.

Blend 3

For our third plot, we decided to visualize crime parameters for all of the different counties in New York. We decided to make a bubble plot, where each county was colored and labeled distinctly. We then sized each bubble according to the total number of arrests in that county for all types of crime. Because each county was given a bubble for each year, coloring by county was useful to see which counties had consistently high (or consistently low) number of arrests across years. It can be seen that some counties, such as Bronx and Kings, were the ones with consistently high number of arrests, whereas some counties are so small that their labels did not even show up.

Blend 4

For our final plot, we made another calculated field, this time to calculate the ratio of the number of drug felonies to the number of drug misdemeanors, for each county. The calculation was done very similar to the ones under Blend 2, with just a simple division of two sums. We plotted this ratio for all of the different counties using a bar graph, and colored the bars according to their value. If the ratio was over 1, the bars would be red, and anything else would be green. In addition, we adjusted the width of the bars according to the total number of crime sentences for that respective county. Surprisingly, we found that the counties with red bars (that is, more felonies than misdemeanors) were also counties with thin bars (meaning a low number of crime sentences). This seems a little counterintuitive, but there are some exceptions to every case.

More Data Sources

In order to spice up our data even further, we decided to incorporate data from other databases to our current plots. We added a database on the number of home delivered meals to elderly people in NY, as well as statistics on foster care in NY. The following two plots illustrate this data being blended with some of our previous data sources.

The plot shown below illustrates the number of meals served per county per year for all of the counties in New York. That is, every county gets their own “bubble” for each year, and the bubbles are color coded by counties. It is clear from the data that although New York City is not the most populous county, it delivers the most meals to elderly people year after year.

This next plot show the relationship between different crime counts and the number of children in foster care across different counties for different years. Although there is much spread in the data, it can be seen that there is a rought linear relationship between the different crimes and the children in foster care.